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Executive  Summary 

Purpose.   Our  goal  in  undertaking  this  project  was  to  develop 
a  model  describing  the  various  costs  of  handling  data  in  a  network 
setting.   This  model  could  then  be  used  to  study  the  cost  effectiveness 
of  various  data  distribution  strategies  and  to  identify  the  most  important 
sources  of  cost. 

The  model.   We  began  by  searching  the  literature  for  studies 
which  could  form  a  good  starting  point  for  our  work.   The  existing  model 
which  seemed  closest  to  what  we  needed  is  one  developed  by  a  group  at 
IBM  Research  (San  Jose).   This  model  [Lum  et  al. ,  1975]  describes  a  data 
staging  process  in  a  hierarchical  memory.   That  is,  the  data  is  assumed 
to  be  stored  on  a  slow,  cheap  storage  device  when  not  in  use  and  trans- 
ferred to  a  rapid,  expensive  device  for  accessing.   What  attracted  us  to 
this  model  was  its  fineness  of  detail  and  the  ease  with  which  we  felt  we 
could  extend  it  to  a  network  situation  by  including  (as  part  of  the 
hierarchy)  devices  at  a  remote  site. 

Deficiencies  in  the  model.   We  have  identified  several  problems 
with  the  IBM  group's  approach.   First,  they  implicitly  assume  a  very  low 
rate  of  data  access.   Costs  which  may  in  fact  grow  very  rapidly  with 
increased  load  are  assumed  to  be  proportional  to  the  number  of  accesses 
or  to  the  amount  of  data  handled.   Second,  they  include  a  number  of 
terms  which  represent  lost  CPU  time  induced  by  delays  in  accessing 
devices.   This  seems  to  represent  a  sort  of  primitive  effort  to  parcel 
out  the  cost  of  the  inevitable  CPU  idle  time  among  the  various  processes. 
At  the  same  time  they  omit  some  real  CPU  costs  which  are  incurred  in  the 
data  transfer  process  and  which  may  be  significant.   In  addition,  they 
assume  that  only  CPU  idle  time  adds  significant  costs  and  ignore  costs 
due  to  other  idle  equipment,  such  as  channels. 


In  spite  of  our  reservations,  we  decided  to  work  initally  with 
their  model.   We  felt  that  the  questions  we  had  about  it  might  be  more 
easily  and  more  rationally  resolved  after  we  had  experimented  with  it 
and  better  understood  its  limitations,  as  well  as  its  good  points.   We 
therefore  began  with  Lum's  cost  formula,  with  its  terms  for  storage, 
data  transfer,  and  accessing,  and  added  network  terms  -  including  costs 
for  data  transfer  to,  from  and  over  the  network,  as  well  as  protocol 
costs.   In  adding  these  terms,  we  felt  that,  if  only  for  consistency,  we 
should  follow  the  spirit  of  Lum's  model.   Hence  the  extended  model  also 
has  terms  involving  costs  of  "lost"  CPU  idle  time. 

At  this  point,  we  have  not  introduced  important  complexities, 
such  as  a  provision  for  remote  processing.   This  is  a  critical  omission 
since  remote  processing  intuitively  seems  to  offer  the  greatest  benefits 
in  distributed  data  processing. 

At  its  current  level  of  development,  the  model  has  severe 
limitations.   The  questions  we  raised  with  respect  to  Lum's  model  carry 
over.   The  model  only  describes  data  staging  and  no  other  aspect  of 
distributed  data  management.   On  the  other  hand,  the  questionable  terms 
in  the  model  are  usually  small  enough  so  that  their  probable  inaccuracies 
are  unlikely  to  seriously  affect  the  kind  of  broad  conclusions  that  we 
want  to  draw. 

The  model  is  adequate  to  make  an  initial  study  of  the  key 
question:   Is  it  ever  more  economical  to  store  data  at  a  remote  site 
(instead  of  locally)  and  bring  it  over  the  net  when  needed?  We  have 
used  the  model  to  study  this  question.   We  believe  that  the  results  of 
the  study  have  validity  for  real  systems.   Improvements  in  the  model  are 
not  expected  to  change  our  conclusions  significantly. 


Conclusions.   The  main  result  of  our  study  is  that  hetero- 
geneity is  a  necessary  requirement  for  remote  storage  to  be  cost  effec- 
tive.  This  conclusion  is  intuitively  reasonable.   Transferring  data 
over  the  network  must  cost  something  -  and  this  additional  cost  is 
inevitably  incurred  if  the  data  is  stored  at  a  remote  site.   Therefore 
the  remote  site  must  be  significantly  cheaper,  in  some  respect,  than 
the  local  site  in  order  to  offset  the  network  costs. 

There  are  several  ways  in  which  such  heterogeneity  may  be 
achieved : 

1.  Excess  capacity.   That  is,  some  sites  may  be  less  heavily 
loaded  either  because  of  usage  patterns  or  because  of  system 
differences. 

2.  Inexpensive  storage.   Special  facilities,  such  as  the  ARPA 
Network  Data  Computer,  may  be  available  at  one  site. 

3.  Artificially-induced  heterogeneity.   This  may  be  achieved  by 
arbitrarily  setting  charging  rates  at  some  sites  so  that  they 
are  significantly  cheaper  than  at  other  sites. 

It  should  be  emphasized  that  in  most  situations  the  cost 
differential  due  to  heterogeneity  must  be  sizable  -  not  small  percentages, 
but  orders  of  magnitude.   As  the  amount  of  data  transported  over  the 
network  decreases,  the  network  costs  can  decrease  to  the  point  where  small 
cost  differentials  can  make  remote  storage  economical. 

The  interested  reader  will  find  an  extensive,  detailed  discussion 
of  these  cost  balances  in  this  document.   He  should  be  warned,  however, 
that  the  model  is  sufficiently  complex  (having  some  35  parameters)  that 
careful  study  is  required  to  gain  a  thorough  understanding  of  the  model 
and  the  detailed  results. 


Finally,  we  reiterate  that  the  work  described  here  is  an 
initial  effort.   We  have  now  identified  the  weaknesses  of  Lum's  model 
and  believe  that  we  can  proceed  to  build  a  model  which  more  closely 
describes  distributed  data  management.   In  particular,  we  believe  that 
it  is  a  straightforward  problem  to  extend  the  model  to  the  point  where 
it  can  be  used  meaningfully  in  research  on  front-ending  and  intelligent 
terminals. 


Introduction 

The  advantages  of  distributing  a  data  base  in  a  network 
environment  have  been  discussed  at  length  in  various  papers,  panel  dis- 
cussions, and  bull  sessions.   But  it  has  been  somewhat  difficult  to 
quantify  these  advantages  or  to  investigate  the  various  tradeoffs  and 
to  determine  just  how  great  the  advantages  are. 

Several  researchers  have  investigated  the  problem  of  optimally 
allocating  files  in  a  network  to  achieve  minimum  cost.  ([Casey,  1972], 
[Chu,  1973]).   The  intent  of  this  paper  is  to  try  to  gain  some  under- 
standing of  where  the  major  cost  factors  are  incurred  and  under  what 
circumstances  or  strategies  accessing  a  distributed  file  system  is 
worthwhile. 

For  many  of  the  cost-related  questions  that  arise  in  the  de- 
velopment of  a  distributed  data  base  system  (such  as  those  concerned 
with  the  costs  of  queries,  updates,  backup,  recovery,  etc.),  the  system 
can  at  first  be  viewed  as  a  storage  hierarchy.   That  is,  to  a  local 
process  or  user  submitting  a  query  to  a  remote  site,  storage  devices 
at  that  site  appear  as  further  levels  of  the  hierarchy.   From  this  point 
of  view  the  network  is  another  channel  with  some  special  cost  considera- 
tions.  In  this  paper  we  develop  and  study  this  sort  of  simple  storage 
hierarchy  model  of  distributed  data  processing.   This  approach  will  allow 
us  to  investigate  the  tradeoffs  offered  by  various  strategies  without 
becoming  involved  in  the  complexity  of  deciding  which  remote  site  should 
be  chosen.   In  fact,  what  we  are  attempting  here  is  to  determine  what 
criteria  such  a  decision  might  be  based  on  and  the  degree  of  cost  control 
offered  by  each  criterion.   In  future  refinements  of  the  model,  we  plan 
to  include  effects  of  processing  data  at  the  remote  sites  in  order  to 
take  advantage  of  cheaper  computation  or  possible  parallelism. 


Previous  Work  on  Cost  Models  for  Computer  Systems 

Cost  is  both  a  very  vague  and  ambiguous  measure  of  system 
performance  and  a  very  important  one.   The  ambiguity  comes  about  through 
the  difficulty  of  assigning  dollar  costs  to  all  factors  of  interest. 
One  way,  of  course,  is  to  carry  out  experiments  -  i.e.,  to  run  test 
programs  at  various  sites  and  compare  the  bills  received.   This  method 
yields  cost  comparisons  which  are  heavily  dependent  on  the  pricing 
policies  of  the  various  sites  as  well  as  on  site  hardware  and  software. 
Untangling  all  of  these  factors  to  determine  what  a  set  of  cost  figures 
really  means  is  no  easy  task.   On  the  other  hand,  cost  is  very  important 
in  that  it  serves  as  an  overall  measure  of  system  resource  utilization. 
For  example,  by  assigning  costs  to  them,  such  diverse  factors  as  CPU 
time  and  storage  used  can  be  added  together.   In  short,  costs  are  a 
device  by  which  one  can  add  together  apples  and  oranges. 

Assignment  of  specific  costs  to  various  factors  is  of  importance 
to  the  model  user,  but  not  necessarily  to  the  model  builder.   The  latter 
can  consider  costs  of  various  resources  to  be  simply  weighting  coeffi- 
cients, which  can  be  adjusted  at  will  to  reflect  a  specific  environment. 
It  may  be,  for  example,  that  no  real  money  changes  hands.   But  a  user 
may  still  wish  to  evaluate  a  certain  system  or  piece  of  software  by 
using  a  formula  which  weights  storage  (which  may  be  in  short  supply) 
much  more  heavily  than  CPU  time. 

Modeling  network  file  allocation.   Of  particular  relevance  to 
our  study  of  distributed  data  management  are  the  cost  analyses  developed 
for  the  network  file  allocation  problem.   A  good  example  of  such  an 
analysis  is  that  given  by  Casey  [1972].   The  parameters  in  his  model  are 
1.   the  cost  ("mainly  for  storage")  of  locating  the  file  at  any 
site, 


2.  the  costs  of  transmitting  a  given  amount  of  data  between  two 
given  sites  (with  the  possibility  that  update  and  query  trans- 
actions may  be  transmitted  at  different  costs), 

3.  the  amount  of  update  traffic  emanating  from  each  site,  and 

4.  the  amount  of  query  traffic  emanating  from  each  site. 

Given  values  for  these  parameters,  the  cost  of  a  particular  allocation 
is  readily  computed. 

Casey  states  that  transmission  costs  may  be  "a  rather  complex 
monotonically  increasing  function"  of  traffic,  but  he  feels  that  his 
linear  model  is  a  good  first  approximation.   A  better  idea  of  transmission 
costs  would  require  a  model  which  goes  into  the  transmission  process  in 
some  detail  and  analyzes  the  various  cost  components  and  how  they  are 
affected  by  the  amount  of  network  traffic.   The  site  costs  might  also 
profit  from  a  detailed  breakdown;  note  that  Casey  remarks  that  factors 
other  than  storage  are  being  lumped  into  one  term.   It  is  important  to 
realize,  however,  that  for  file  allocation  Casey's  model  is  probably 
quite  adequate.   It  is  only  when  one  wishes  to  study  other  aspects  of 
data  distribution  -  backup  and  recovery  strategies,  say  -  that  more 
detail  is  needed. 

Modeling  storage  hierarchies.   Even  before  networks  existed, 
the  file  allocation  problem  was  of  importance.   The  question  arose  as  to 
where  one  should  place  a  given  file  in  a  storage  hierarchy  -  i.e.,  a  set 
of  memory  devices  of  varying  accessibility  (core,  disk,  tape,  etc.) 
connected  to  a  single  computer.   A  particularly  comprehensive  cost  model 
for  this  problem  has  appeared  [Lum  et  al. ,  1975].   This  model  differentiates 
between  random  and  sequential  forms  of  data  access  and  includes  consider- 
ations of  staging,  channel  costs,  CPU  overhead,  etc.   Because  of  its 
completeness,  we  considered  this  model  an  appropriate  one  for  extension 


to  the  network  case.   That  is,  memory  devices  at  a  remote  site  may  simply 
be  considered  as  parts  of  the  storage  hierarchy,  provided  that  network 
costs  are  properly  taken  into  account.   A  detailed  discussion  of  the  model 
of  Lum  et  al.  appears  below. 

The  distributed  data  management  problem  is  of  course  far  more 
complex  than  the  storage  hierarchy  problem.   The  model  of  Lum  et  al.  (and 
this  extension  of  it)  assumes  that  all  data  processing  (updating  and 
responding  to  queries)  takes  place  in  local  core.   No  provision  exists 
for  sending  a  query  to  a  remote  site  for  processing.   Thus,  although  our 
straightforward  extension  of  Lum's  storage  hierarchy  model  has  provided 
some  insight  into  data  distribution,  it  is  grossly  inadequate  for  studying 
all  the  many  facets  of  distributed  data  management. 

In  what  follows  we  will  first  review  the  model  described  in 
[Lum  et  al. ,  1975],   (In  order  to  facilitate  the  discussion,  this  model 
will  be  referred  to  henceforth  as  the  LSWL  model.)   Next  we  will  extend 
the  LSWL  model  to  include  a  network.   Then  we  will  use  the  model  along 
with  some  relevant  data  to  investigate  the  properties  of  the  model  and  to 
analyze  some  conditions  and  strategies  under  which  remotely  accessing 
data  may  be  useful.   Finally,  we  will  discuss  future  refinements  and 
further  experiments  that  would  be  of  interest. 
A  Review  of  the  LSWL  Model 

Overview.   The  LSWL  model  primarily  addresses  the  problem  of 
"data  staging"  or  "data  migration".   In  other  words,  when  a  file  or  data 
set  is  not  being  used  (i.e.,  is  inactive)  it  is  stored  on  one  device 
(usually  a  relatively  slow,  inexpensive  one).   Then,  when  the  data  set 
is  accessed,  it  is  moved  to  a  faster,  more  expensive  device  so  that  the 


program  will  waste  fewer  resources  waiting  for  data.   The  question  we 
are  concerned  with  here  is,  given  the  accessing  characteristics  (number 
of  reads  and  writes,  proportion  of  time  the  file  is  in  use,  etc.),  where 
in  a  given  hierarchy  should  the  data  set  be  stored  when  it  is  inactive 
and  where  should  it  be  moved  when  it  is  active? 

Lum  et  al.  develop  an  objective  function  which  gives  the  cost 
of  accessing  a  data  set  which  is  stored  on  one  device  when  inactive  and 
another  (possibly  the  same  device)  when  active.   In  this  model  the 
entire  data  set  is  moved  from  the  inactive  device  to  the  active  one. 
(We  shall  relax  this  requirement  in  our  model.) 

The  selection  algorithm  is  then  quite  straightforward.   The 
objective  function  is  evaluated  for  a  given  set  of  variables  for  each 
pair  of  devices  in  the  hierarchy.   The  lowest  cost  then  indicates  on 
which  pair  of  devices  the  data  should  be  located. 

Assumptions.   The  authors  make  several  simplifying  assumptions, 
most  of  which  can  be  relaxed  at  the  cost  of  a  more  complex  cost  function. 
They  assume  that  for  data  sets  system  paging  activity  will  not  signifi- 
cantly affect  cost.   However,  it  would  probably  be  necessary  to  relax 
this  constraint  if  one  wished  to  consider  costs  incurred  by  program 
activity.   They  further  assume  that  transfers  are  direct  rather  than 
through  core  and  that  there  are  no  flow  control  problems  (i.e.,  a  fast 
device  can  always  accept  data  from  a  slow  device) .   It  is  also  assumed 
that  transfers  are  not  constrained  by  the  capacity  of  the  device  the 
data  set  is  being  moved  to.   These  last  two  assumptions  can  both  be 
dropped  at  the  cost  of  a  more  complex  equation.   As  we  shall  see,  when 
we  add  a  network  to  the  hierarchy,  flow  control  can  not  be  ignored. 


The  authors  also  assume  that  the  data  is  only  staged  between 
two  levels,  and  that  multiple  staging  does  not  occur  (such  as  disk  pack 
to  bulk  memory  to  core,  as  might  happen  in  Multics). 

Although  for  the  most  part  we  carry  over  these  assumptions 
underlying  the  LSWL  model  to  our  analysis,  we  will  relax  the  assumption 
that  the  entire  data  set  is  staged.   This  will  allow  us  to  simulate  the 
ability  to  retrieve  only  that  part  of  the  data  required. 

The  objective  function.   Now  that  we  have  reviewed  the  assump- 
tions behind  this  analysis,  let  us  look  at  the  cost  function  itself  in 
some  detail.   The  reader  should  consult  table  1  for  a  key  to  the  symbols 
used  and  figure  1  for  a  summary  of  the  objective  function. 
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Figure  1 
Objective  Function  for  the  LSWL  Model 

Let  us  assume  that  the  data  set  is  at  level  i  of  the  hierarchy 
when  inactive  and  at  level  j  when  active.   (For  consistency  we  will 
adopt  the  notation  used  by  Lum  et  al.  whereby  the  first  subscript  will 
be  the  inactive  device,  and  the  second  the  active  one.   Also  the  higher 
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Data  Set  Characteristics: 

q  =  number  of  sequential  block  assesses. 

r  =  number  of  random  block  accesses. 

S'  =  total  data  set  size. 

S  =  amount  of  data  moved  to  the  active  level. 

s  =  physical  block  size. 

t.  =  fraction  of  time  data  set  is  on  level  i. 

d  =  number  of  times  the  data  set  is  opened. 

X  =  the  proportion  of  time  to  write  the  data  set  back  to  its 

original  position.   For  read-only  data  sets,  X  =  0;  for 

full  write  back  at  read  speed  X  =  1. 

Storage  Device  Characteristics: 

t   =  random  access  time  for  level  i. 

r 

t   =  sequential  access  time  for  level  i. 

t   =  transmission  rate  to  or  from  level  i. 
s 

t1   =  average  rotational  latency  time  for  level  i. 

t   =  minimum  access  arm  movement  time  for  level  i. 
c 

n.  =  unit  cost  of  storage  space  at  level  i  for  the  given  time 

period. 

b.  =  transfer  size  per  access  when  data  set  is  being  moved  from 
a  lower  level  i  to  another  level  (or  from  a  higher  level  to 
level  i) . 

B.  =  largest  size  that  can  be  transferred  without  additional  access 

1      Z. 
cost. 

CPU  and  Channel  Characteristics: 

m  =  adjusted  cost  per  unit  time  for  computer  system  excluding 

channel  -  an  estimate  of  computer  wait  time  induced  by  I/O 
M  =  unadjusted  computer  system  cost  per  unit  time 
u  =  cost  of  channel  per  unit  time 
3  =  number  of  buffers 
w  =  computer  setup  time  for  opening  a  data  set 

Table  1 

Parameters  in  the  LSWL  Model 
(adapted  from  [Lum  et  al. ,  1975]) 
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levels  (i.e.,  those  with  faster  access)  of  the  hierarchy  will  have 
higher  indices.)   The  objective  function  can  be  considered  to  have  three 

major  terms:  ^ 

m   [storage]  +  flocal  process"]  +  J  ^^ 
ij    1  cost  J   1  access  costs  I   \ 

The  first  term  is  the  cost  of  storing  the  data  on  the  active 

and  inactive  devices. 

{storage  cost}  =  x.n.S'  +  x.n.S 

11      J  J 

When  a  data  set  is  moved  from  level  i  to  level  j  it  is  not  necessarily 

deleted  from  level  i;  therefore  it  should  be  noted  that  x.  +  x.  >  1. 

i    J  - 

(Note:   In  the  LSWL  model  S  always  equals  S1,  but  to  investigate 
the  properties  of  partial  staging  and  for  reasons  of  clarity  we  have 
made  this  modification.) 

The  second  term  is  the  cost  for  the  user  or  process  to  access 
the  data  from  the  active  device.   This  term  takes  into  account  the  CPU 
costs  and  transfer  overhead  as  well  as  channel  costs  for  both  random  and 
sequential  accesses.   The  components  of  the  access  cost  term  are: 


f  CPU  costs  for     \      r/   i  i„\         ,    i      1m 

{  _•  ->         =  mq[(t  J/3)  +  (s/t  J)] 

(sequential  access  J         q  s 

/  CPU  costs  for  \         r   j    ,    ,      iN, 

\         A  t  mr[t  J  +  (s/t  J)] 

^random  access  J  r        s 


{channel  costs  fori      r/  j  ...    ,  .   j.. 
sequential  access  =  uq[(tl  /B)  +  (s/ts  )] 


j  channel  costs  for 

V1 

rchannel  costs  fori  _       j        j 
I  random  access     j        1        s 


The  components  of  this  term  identified  as  "CPU  costs"  are  measures  of 
the  cost  of  delays  incurred  by  the  random  and  sequential  accesses  and 
not  of  actual  resources  consumed  by  the  process  or  in  its  behalf.   For 
a  more  lengthy  discussion  of  these  costs  and  the  quantity  m,  see  below. 
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The  final  term  (staging  transfer  costs)  computes  the  cost  of 

moving  the  data  from  level  i  to  level  j  and  includes  factors  for  writing 

the  data  back  to  level  i  if  necessary,  preparation  for  transfer,  latency 

waiting  for  the  next  block,  and  block  transmission  costs. 

(cost  to  move  data  between  I    =  (1  +  A)d{Mw  +  (S/b.)[mt  X  +  (mb./t  1) 
\   level  i  and  level  3  J  .  3L    .        1  s 

+  (ub  /t  X)]    +    (mS/B.)t/}r(i  -  j), 

where  T(x)  is  0  if  x  =  0  and  is  1  otherwise. 

Notice  that  this  model  says  that  if,  say,  only  10  percent  of  the  data  is 

shipped  back  (A  =  0.1),  then  only  10  percent  of  the  setup  cost  Mw  is 

incurred  by  this  operation.   Clearly  this  is  incorrect;  the  cost  of 

setup  is  independent  of  the  amount  of  data  subsequently  transferred.   We 

have  therefore  corrected  the  setup  term  in  our  model  to  read  (1  +  r(A))Mwd, 

At  this  point  it  is  appropriate  to  discuss  the  parameter  m 

in  some  detail.   When  a  process  or  user  accesses  a  data  set,  it  must 

wait  for  this  access  to  complete.   This  delay  consists  primarily  of  the 

time  required  to  set  up  the  device  (rotational  latency  or  arm  movement) 

and  the  time  to  transfer  the  data.   Clearly,  multiprogramming  systems 

take  advantage  of  this  wait  time  by  allowing  other  processes  to  utilize 

the  processor.   However,  these  delays,  which  are  incurred  by  all  running 

processes  in  the  system,  contribute  to  the  total  amount  of  CPU  idle  time. 

To  account  for  this  lost  time  Lum  et  al.  define  an  "adjusted  machine 

cost",  m.   For  lack  of  a  better  formulation,  they  have  defined  this  cost 

to  be  percent  of  CPU  idle  time  times  the  dollar  cost  associated  with  the 

CPU.   There  are  some  difficulties  with  such  a  definition.   For  example, 

as  the  load  on  the  system  increases,  so  may  CPU  utilization,  queueing 

delays  and  system  overhead,  thus  increasing  cost.   The  objective  function 

does  not  account  for  this  phenomenon.   This  characterization  also  assumes 
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that  the  CPU  is  the  crucial  resource  to  be  utilized.   Current  trends  in 
hardware  could  actually  make  this  assumption  false.   It  may  also  be 
false  for  certain  specific  applications.   It  might  be  equally  valid  to 
include  idle  channel  time  incurred  by  a  process  because  it  was  using  the 
processor.   We  intend  to  investigate  this  issue  in  more  detail  in  the 
future. 
Network  Model 

The  model  discussed  here  will  require  further  extensions  to 
model  the  cost  of  a  distributed  data  management  system  in  complete  de- 
tail.  However,  it  is  a  reasonable  first  approximation  and  will  allow 
investigation  of  the  tradeoffs  between  storage  and  access  economy,  as 
well  as  provide  an  accurate  model  of  file  or  data  set  staging  in  a 
network  environment. 

As  mentioned  earlier,  a  primary  concern  in  extending  the  LSWL 
model  to  allow  for  a  network  in  the  hierarchy  is  to  account  for  the  flow 
control  and  other  protocol-related  costs  that  will  be  incurred.   The 
cost  function  used  has  the  basic  form: 
r 


f  .  .     i  >  k 
c .  .  =  (  S  (j  always  greater  than  k) 

13   ■  g..     i  <  k 

where  k  is  the  first  remote  level  of  the  hierarchy.   (Here  we  are 
tacitly  assuming  that  all  staging  will  be  done  to  a  local  device.)   We 
have  already  discussed  the  original  objective  function,  f...   We  will 
now  proceed  to  consider  the  cost  function  that  deals  with  the  network. 
The  reader  is  directed  to  table  2  for  a  key  to  additional  symbols  and  to 
the  summary  of  g..  in  figure  2.   The  network  cost  function  can  be  char- 
acterized as: 
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e  =  number  of  message  exchanges  necessary  to  set  up  the  transfer 

t  ,  =  message  round  trip  delay  time  in  the  network 
nd 

t   =  CPU  time  for  protocol  overhead  (on  a  per  protocol  message  basis) 

K  =  compression  factor 

t   =  network  CPU  time  to  receive  data 
nr 

t   =  network  CPU  time  to  transmit  data 
nt 

u  =  remote  channel  cost 
r 

u  =  local  channel  cost 

m  =  adjusted  remote  system  cost 

m  =  adjusted  local  system  cost 

n,  =  network  transmission  cost 
k 

M  =  unadjusted  remote  system  cost 


£L  =  unadjusted  local  system  cost 
b,  =  network  packet  size 


Table  2 
Supplementary  Parameter  List  for  Network  Model 
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K..    =   T.n.S'    +  x.n.S  +  (1)    storage  cost 

ij  11  J    J 

(1  +  X)d{(SK/b1  )[ib1/tk  +  ub,/tk]}+  (2)    cost   to  move  between 

krks  rks  ,  .   .  -        . 

highest  remote  level 

and  net 

{(1  +  r(X))dM  w}  +  (1  +  X)d{(S/b.)[m  t-1  +       (3)  cost  to  move  between 

inactive  level  i  and 

(m  b./t  1)  +  (u  b./t  X)]  +  (m  S/B.)t  1}  +        highest  remote  level 
ris       ris        r   i  c 

de{(m  +  mT)t  ,  +  (M  +  M)t   }{1  +  T(X)}  +      (4)  protocol  setup  cost 
r    L  nd     r    L  np 

2en  d{l  +  T(X)}  +  (5)  network  charges  for 

protocol  messages 

(1  +  X) (SKn  /b,)d  +  (6)  data  transfer  network 

costs 

(M  t        +  Mt      )(S/b.)d  +  (7)    network  software  cost 

r  nt  L  nr  k  ,    ,  , 

to   send  data  and 

X(M   t        +MTt      )(S/b,)d  +  receive    it 

r  nr  L  nt  k 

mTq[(t   J/3)    +    (s/t   3)]    +  mTr[t   3   +  s/t   J ]    +  (8)    CPU   costs    for   random 

Lq  s  Lr  s  ,  ^  .    1 

and   sequential  access 

and   for  retrieval   from 

active   location 

u_q[(t.J/3)    +    (s/t   J)]    +  uTr[t-J    +    (s/t   J)    +  (9)    channel   costs    for 

LI  s  LI  s  ,         -         .     .         , 

local  retrieval 

(1  +  X)d{(SK/b  ) [(m  b  /t  k)  +  (u  b  /t  k) ] }      (10)  cost  to  move  between 

KLiKS  J-iiCS  ._,.         .       . 

net  buffers  and  active 
device 


Figure  2 
Objective  Function  for  the  Network  Model 
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fstorage\     /  COSt  to  move  between  inactive!    (cost  to  move  between 
8ij   =  )  cost    j  +   \  remote  level  and  highest      >  +  /  highest  remote  level 
L       '  (^remote  level  J         1  and  net 

+  (network  \   +   ("cost  to  move  between  net  and^l     f process  access^ 
^costs  J  ^active  level  J  <  costs         ( 

The  major  differences  in  this  equation  from  the  purely  local 
version  are  the  added  network  costs  and  the  distinction  between  local 
and  remote  charging  rates.   Otherwise  most  of  the  terms  are  special 
cases  of  the  original  and  we  will  not  discuss  them  in  detail.   For  a 
summary  of  the  staging  process  and  the  various  costs,  the  reader  should 
consult  figure  3,  which  shows  schematically  where  the  various  terms 
(labeled  as  in  figure  2)  enter  into  the  data  transfer  process. 

The  network  costs  consist  of  two  major  components:   the  setup 
costs  for  using  the  network  and  the  cost  of  the  traffic  sent  on  the 
network. 

{^network  costs)  =   de{(m  +  m  )t   +  (M  +  M^)t   }{1  +  T(A)} 

+  2en  d{l  +  T(X)} 

+   (1  +  A)  (SKnk/bk)d 

The  first  term  (term  (4)  in  figure  2)  is  the  cost  of  setting 
up  the  transfers  in  terms  of  the  number  of  message  exchanges  required 
(protocol  negotiation),  network  delay  and  protocol  processing.   The 
other  two  terms  are  network  charges  for  the  packets  actually  sent.   The 
first  of  these  (term  (5)  in  figure  2)  is  the  cost  for  the  protocol 
negotiation  and  connection  setups,  and  the  second  (term  (6))  is  the  cost 
of  data  actually  sent.   The  constant  K  in  this  last  term  is  a  "compression" 
factor  to  allow  inclusion  of  data  compression  and  protocol  overhead  in 
data  transmission  (headers,  restart  markers,  etc.).   The  transmission 


17 


UJ  uj 

— 

t  o 

(O 

o  — 

.— 

2  > 

c 

LlI 


UJ 

a. 


UJ 


UJ   r- 


CO 


cc 
o 

I- 

UJ 

z 


UJ  < 

I  o 

h-  o 


v  Ul 


UJ 

z 


o 
co 


O  UJ 
-I  Q 


Ul 

co 

3 


CO 
CO 
UJ 

o 
o 
cr 
a. 


10 
oC 


m 

cu 

•H 


y 


6 
cu 
■u 

>•> 

Cfl 

a) 
4-1 

M-l 

o 


o 

0) 

CD 
Cfl 

o 

•1-5 

a 

<u 

.e 


•H 


00 


CO 

a) 
■u 

cu 

M-l 
O 

CD 
O 

CU 

O 

a 
w 
cu 

>-l 
u 
o 
o 

0) 


cost  of  the  network,  n  ,  is  calculated  in  terms  of  packets  sent,  a 
charging  structure  in  use  in  the  commercial  world.   (It  should  be  noted 
that  the  symbols  with  the  subscript  k  do  not  refer  to  the  properties  of 
the  highest  remote  level  of  the  hierarchy  but  to  properties  of  the 
network,  such  as  transmission  rate,  packet  size,  etc.)   Factors  involving 
X  are  included  in  the  network  costs  to  take  account  of  the  possibility 
of  shipping  the  data  back  to  inactive  store.   Notice  that  a  transfer 
must  be  set  up  no  matter  how  small  an  amount  is  sent  back  -  hence  the 
appearance  of  T(A)  in  the  formula.   Terms  (2),  (7),  and  (10)  (see  figure 
2) ,  which  are  costs  of  data  transfer  to  and  from  the  network,  will  also 
be  considered  as  part  of  "network  costs"  in  our  later  analysis,  since 
they  form  important  components  of  the  additional  cost  of  storing  at  a 
remote  site.   But  in  form  they  are  similar  to  the  local  transfer  costs 
of  Lum's  model  and  so  do  not  need  further  discussion  here. 

Example.  Consider  a  situation  in  which  there  is  a  four-level 
hierarchy  (core,  drum,  disk,  and  archive),  both  locally  and  at  a  remote 
site.  Assume  that  values  of  the  relevant  parameters  are  as  given  in 
table  3  (taken  from  Lum  et  al.  [1975])  and  that  they  are  the  same  at 
both  sites.  It  does  not,  of  course,  make  sense  to  consider  inactive 
storage  at  remote  core,  and  this  case  is  omitted.  Let  the  number  of 
local  buffers  be  two  (3=2)  and  assume  that  there  is  no  setup  time  to 

o 

open  a  data  set  (w  =  0).   Suppose  that  a  data  set  of  10  bytes  is  active 
for  one  eight-hour  shift  per  day,  so  that  on  a  per-month  basis  d  =  30 
(i.e.,  the  data  set  is  opened  once  per  day).   Furthermore,  the  set  is 
then  active  1/3  of  the  time  (x.  =  1/3),  and  we  shall  assume  that  t±  =   1 
(i.e.,  that  the  set  is  permanently  resident  at  the  inactive  location). 
Let  the  set  be  blocked  into  1500-byte  physical  records  (s  =  1500)  and 
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suppose  that  X  =  1  (so  that  the  data  set  is  always  written  back  at  the 
end  of  each  day).   Finally  assume  that  there  are  90,000  sequential 
accesses  to  the  active  copy  per  month  and  210,000  random  accesses  (i.e., 
q  =  90,000  and  r  =  210,000).   These  values  all  correspond  to  those  used 
by  Lum  et  al.  in  their  example.   Notice  that  the  total  number  of  accesses 
(300,000  per  month)  is  very  low,  amounting  to  less  than  one  I/O  per 
second.   The  reader  should  keep  in  mind  this  hidden  assumption. 


Parameter 

Core 

Drum 

Disk 

Archive 

Unit 

i 
t 
r 

ID"6 

5  X  10"3 

60  X  10"3 

5 

second 

i 
t 
s 

oo 

106 

3  X  105 

5  X  104 

byte/sec 

t  i 

q 

0 

8  X  10"3 

13  X  10"3 

25  X  10"3 

second 

i 

0 

8  X  10"3 

12  X  10~3 

20  X  10"3 

second 

t  1 

c 

0 

0 

25  X  10~3 

40  X  10"3 

second 

n. 

i 

2  X  10"2 

5  X  10"4 

3  X  10"5 

3  X  10~7 

$/byte/ 
month 

b. 

i 

* 

20,000 

7,000 

2,000 

byte 

B. 

l 

* 

4  X  106 

140,000 

10,000 

byte 

*  Irrelevant 


Table  3 


Parameters  for  Storage  Hierarchy 


Next,  network  parameters  are  needed.   We  have  taken  b,  =  125 

k         3 
bytes,  the  ARPANET  packet  size;  t  ,  =  200  ms  and  t    =  5  x  10  bytes/sec, 

nd  s 

both  ARPANET  figures;  t   =  1  ms,  which  is  roughly  the  time  for  an  ARPA 

np 

NCP  to  handle  one  protocol  command  (including  response);  t   =1  ms,  an 
average  figure  which  runs  from  about  .5  ms  NCP  time  to  2  ms  if  the 
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nr 


process  must  be  awakened;  and  t   =  2  ms,  which  consists  of  about  1  ms 

nt 

to  get  to  the  NCP  and  0.5  to  1  ms  to  use  it.   (These  estimates  for  t 

np' 

t   ,  and  t   were  supplied  to  us  by  G.  Grossman  of  the  Center  for 

Advanced  Computation.)   It  should  be  noted  that  both  t   and  t   should 

nr      nt 

be  slightly  larger  to  allow  for  data  processing  by  the  file  transfer 
protocol.   This  is  particularly  true  if  data  compression  is  being 
carried  out.   But  for  this  example  we  initially  assume  K  =  1.   Also,  t 
and  t   as  given  are  times  per  message;  we  have  divided  by  8  to  get  a 
per-packet  estimate,  since  a  maximum  of  8  packets  per  message  is  allowed 
The  parameter  e  was  set  at  15.   This  is  arrived  at  as  follows.   In  the 
ARPANET,  it  requires  7  exchanges  to  open  an  FTP  connection,  plus  from  4 
to  7  commands  to  set  parameters  and  3  more  to  open  the  data  connection. 
It  should  be  noted  that  by  using  ARPANET  data  and  the  values  supplied  by 
Grossman  we  are  essentially  computing  lower  bounds  on  network  costs.   In 
other  environments  the  network  costs  will  be  higher  and  results  are 
likely  to  be  quite  different. 

Finally,  cost  estimates  are  needed.   For  network  transmission 
we  assumed  n  =  $1.25  per  1000  packets,  a  quoted  Telenet  commercial 
rate.   To  begin  with  we  have  assumed  that  hl  =  m  =  $10/hr. ,  M  =  M  = 
$100/hr.,  and  u  =  u  =  $8/hr.   Clearly  under  these  assumptions  remote 
storage  will  not  be  cost  effective;  but  by  adjusting  the  cost  of  the 
remote  site  relative  to  that  locally,  we  should  reach  a  point  where 
remote  storage  is  cheaper.   The  values  calculated  for  costs  c    (see 
figures  1  and  2)  are  given  in  table  4.   As  expected,  remote  storage  is 
far  from  being  economical  for  the  assumed  cost  structure.   The  cheapest 
method  is  for  the  inactive  data  to  be  stored  on  local  archive  and 
transferred  to  local  disk  when  active. 
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Active 

Location 

(j) 

Local 

Local 

Local 

Local 

Core 

Drum 

Disk 

Archive 

•H 

Local  Core 

2000 

o 

Local  Drum 

717 

50.0 

•H 
etj 

Local  Disk 

670 

19.8 

3.05 

o 
o 

Local  Archive 

668 

17.5 

1.91 

3.01 

> 

Remote  Drum 

789 

139.0 

123.0 

125.0 

•H 

o 

Remote  Disk 

742 

92.3 

76.7 

78.6 

CO 

d 

M 

Remote  Archive 

740 

90.0 

74.4 

76.4 

Table  4 
Computed  values  of  total  costs  c.  for  the  basic  example. 
Entries  are  in  thousands  of  dollars  per  month. 


Analysis  of  the  Cost  Formula 

In  this  section  we  attempt  an  assessment  of  the  effects  of  the 
various  terms  in  the  formula  for  g...   In  particular,  we  look  at  the 
formula  from  the  point  of  view  of  determining  what  range  of  parameter 
values  or  cost  differentials  will  make  remote  storage  cost  effective. 

Comparing  figures  1  and  2,  notice  that  terms  (8),  (9),  and  the 
second  part  of  term  (1)  (the  cost  of  storage  on  the  local  staging  device) 
appear  in  both  f..  and  g...   They  involve  only  local  costs  and  belong  to 
what  might  be  called  the  post-staging  phase  of  the  access  process. 
These  terms  therefore  play  no  role  in  a  comparison  of  the  absolute 
costs  of  local  and  remote  storage.   They  do,  however,  play  a  role  in  the 
study  of  relative  costs,  since,  if  the  staged  data  is  used  very  heavily, 
terms  (8)  and  (9)  may  form  a  large  part  of  the  total.   The  same  holds 
for  term  (1) ,  if  a  large  amount  of  data  is  staged  and  the  staging  storage 
device  is  costly,  as  it  usually  is.   In  this  section,  however,  we  shall 
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consider  g   -  f   and  so  shall  ignore  terms  (8)  and  (9) ,  as  well  as  the 

second  part  of  term  (1).   We  shall  also  set  A  to  zero,  since  a  non-zero 

value  can  at  most  introduce  a  factor  of  two  into  transfer  costs.   (It  is 

also  reasonable  to  anticipate  building  some  more  rational  update  mechanism 

into  the  model  than  a  simple  shipping  of  a  large  fraction  of  the  data 

(presumably  modified)  back  to  the  original  site.)   We  also  set  x  =  1 

i 

corresponding  to  permanent  storage  on  the  inactive  device. 

Term  (3)  in  g   -  a  transfer  cost  between  levels  -  also  has 

its  counterpart  in  f   •  namely,  the  last  term.   Term  (3)  may  be  written 

more  simply  as 

d[Mw  +  S(md).  +  u  Y.)l, 
r       r  l    r  l   ' 

where  <J> .  and  V.    are  functions  involving  properties  of  the  inactive 
device: 


(f).  =  t-.Vb.  +  1/t  1  +  t  1/B. 
1     1    l       s      CI 


¥,  =  1/t  1 
1       s 

The  last  term  in  f . .  is  quite  similar,  reading 

d[M  w  +  S(iM>.  +  uL,i'-)]- 
Here  we  have  omitted  the  T(i  -  j)  factor  for  comparison  purposes;  this 
omission  is  justifiable  since  it  is  rarely  cost  effective  to  make  the 
staging  storage  the  same  as  the  inactive  store.   If  we  also  assume  that 
the  cheapest  device  for  inactive  store  (either  local  or  remote)  is  the 

same  at  both  sites  (so  that  d> .  and  ¥.  are  the  same  in  both  f..  and  g..), 

ii  ij      ij 

then  we  obtain  the  following  expression  for  differential  cost: 

g..  -  f..  =  d[(Mrwr  -  ML„L  +  S*.(n,r  -  mj)   +  SY.C^  -  u^  ] 
(A)  +S'(n.r-n.L) 

+  {Terms  (2)  +  (4)  +  (5)  +  (6)  +  (7)  +  (10)}. 
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Here  we  have  used  n.   and  n.T  to  distinguish  between  the  remote  and 

lr      lL 

local  costs  for  inactive  storage,  and  w  ,  wx  to  indicate  remote  and 

r   L 

local  file  setup  times,  respectively. 

We  wish  to  investigate  under  what  conditions  g. .  -  f..  is 
approximately  zero.   Consider  first  what  happens  when  we  neglect  the 
network  protocol  costs  (terms  (4)  and  (5)),  and  also  the  setup  time 
(i.e.,  we  set  w  =  w  =  0,  as  does  Lum) .   (The  conditions  under  which 
terms  (4)  and  (5)  are  relatively  small  are  discussed  below.   Setting 
w  =  0  is  invalid  for  many  systems;  the  consequences  of  a  non-zero  w  will 
also  be  discussed  further  below.)   The  expression  for  g..  -  f..  now 


looks  like: 


;.  .  -  f .  .  :  dS[(m  -  mT)cf>.  +  (u  -  uT)^.] 
ij    ij        r    L  l     r    L  i 


! 


(B)  +  S  [n.r  -  n.L] 

+  {Terms  (2)  +  (6)  +  (7)  +  (10)}. 
It  is  important  to  notice  that  the  four  bracketed  terms  contain  a  common 
factor  of  Sd.   Hence  we  can  make  the  following  immediate  remarks  about 
the  approximate  expression  (B) . 

1.  If  S  =  S  ,  or  if  n.   =  n   ,  the  parameter  S  (the  amount  of 
data  transferred)  has  no  effect  on  which  storage  (local  or 
remote)  is  cheaper.   The  cost  differential  is,  of  course, 
proportional  to  S;  however,  relative  costs  are  independent  of 
S. 

2.  If  remote  and  local  storage  costs  are  equal  (n .   =  n.T), 

lr    lL 

the  expression  given  in  (B)  has  a  common  factor  d  (the  number 
of  times  the  data  transfer  takes  place) .   Thus  the  role  played 
by  d  in  the  cost  comparison  is  similar  to  that  played  by  S,  as 
discussed  in  the  preceding  remark.   Equality  of  storage  costs 
is  probably  a  very  realistic  approximation.   Since  the  cheapest 
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inactive  storage  devices  at  the  two  sites  are  likely  to  be 
identical  (or  very  similar),  there  is  a  valid  basis  for 
assuming  a  negligible  cost  differential. 

The  two  preceding  remarks  merely  serve  to  indicate  some  factors 
which  do  not  help  to  make  remote  storage  cost  effective.   There  are  only 
two  features  of  our  model  which  can  help  to  make  remote  storage  cost 
effective.   These  are 

a)  a  lower  cost  for  term  (3)  than  occurs  for  the  comparable  term 
in  f .  .  ,  and 

b)  a  lower  cost  for  remote  inactive  storage  than  for  local 
inactive  storage. 

To  get  some  idea  of  how  great  the  savings  must  be,  we  note  that  even  if 
local  costs  are  large  and  remote  costs  are  zero,  the  network  costs 
(including  cost  of  transfer  to  and  from  the  net)  may  be  large  enough  so 
that  remote  storage  is  not  economical.   Specifically,  this  will  occur 
when  (from  (A)) 

(C)       Terms  (2)  +  (4)  +  (5)  +  (6)  +  (7)  +  (10) 

t 

>  d[MTwT  +  S<J>.m_  +  S¥.uT]  +  S  n.T, 
LL     lL     lL       lL 


where  m  =  M  =  0  in  the  network  terms, 
r    r 

In  view  of  the  preceding  comment,  it  is  worthwhile  to  tabulate 
estimates  of  the  magnitudes  of  the  network  terms  for  closer  analysis. 
Table  5  contains  a  listing  of  the  network  terms  in  a  format  convenient 
for  comparison  and  estimation.   In  each  term,  factors  independent  of  the 
storage  and  transfer  strategies  or  of  host  charging  policies  have  been 
lumped  into  a  single  parameter  and  a  careful  estimate  of  this  parameter 
has  been  made.   In  cases  where  the  parameter  may  vary  widely,  bounds  are 
given.   If  the  variation  is  not  likely  to  be  as  much  as  an  order  of 
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magnitude,  an  average  value  is  given.   A  number  of  further  remarks  may 
be  derived  immediately  from  inspection  of  table  5.   These  are: 

3.  Term  (6)  (data  transfer  cost)  is  generally  the  largest  of  the 
network  terms  by  about  an  order  of  magnitude. 

4.  Terms  (2)  and  (10)  (cost  of  transfers  between  net  and  host) 
become  comparable  to  Term  (6)  only  when  the  constant  C  is  at 
or  near  its  upper  bound.   This  situation  corresponds  to  very 
small  network  bandwidth  (t   about  500  bytes  per  second) . 

5.  Term  (7)  (network  software  cost  of  data  transfer)  is  small 
compared  to  Term  (6)  unless  one  of  the  following  conditions 
holds: 

a)  CPU  time  is  very  expensive, 

b)  network  software  is  more  inefficient  than  assumed,  or 

c)  the  compression  factor  K  is  unrealistically  small. 

6.  The  protocol  costs  (Terms  (4)  and  (5))  are  about  equal  to  each 
other,  although  Term  (5)  dominates  if  CPU  time  is  relatively 
cheap.   Both  of  these  terms  tend  to  be  negligible  compared  to 
Term  (6).   That  is,  they  are  an  order  of  magnitude  smaller 
unless  the  amount  of  data  transferred  (SK)  is  very  small  (less 
than  about  5  x  10  bytes). 

7.  In  summary,  for  most  situations  the  totality  of  the  network 
terms  may  be  approximated  by  Term  (6)  and  hence  estimated  to 
be  dSK  x  10~5.* 


*  The  constant  here  has  dimensions  dollars/byte.   The  reader  should  be 

warned  that  by  consolidating  constants  in  this  analysis  we  have  sometimes 
generated  expressions  which  may  appear  dimensionally  bizarre.   But  the 
units  which  must  apply  to  the  constants  are  readily  reconstructed. 
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Returning  to  comparison  (C)  above,  we  see  now  that,  to  a  good 
approximation,  remote  storage  cannot  be  cheaper  than  local  storage  as 
long  as 
(C)      dSK  x  10    >  d[M  w  +  Scfi.ni  +  SV.u  ]  +  S  n   . 

L  L        1  i-i  1  L  J_J_i 

(That  is,  the  remote  strategy  is  then  more  expensive  no  matter  how  cheap 
remote  storage  and  processing  costs  are.)   Notice  that  if,  as  was  assumed 
earlier,  w  =  0  and  S  =  S  ,  then  inequality  (C')  simplifies  to 

Li 

(C")     K  x  10"5  >  <fr.nL  +  Y.uT  +  n.T/d. 

l  L     l  L     lL 

The  right  side  of  this  inequality  must  be  investigated  further.   Using 

the  parameter  values  given  in  the  example  of  the  preceding  section,  we 

-8  -9 

find  that  for  archival  storage  <j> .  Z   10   and  V.    Z   5  x  10    (both  in 

i  l 

units  of  hours  per  byte) .   For  disk  or  drum  these  factors  are  considerably 

smaller.   Hence  the  numbers  given  are  rough  upper  bounds  on  <J> .  and  V.. 

We  immediately  conclude  that  <J>.iil  and  V.u  are  smaller  by  orders  of 

magnitude  than  the  left  side  of  (C')  and  hence  cannot  contribute  to 

making  remote  storage  cost  effective.   Furthermore,  for  archival  storage 

we  assumed  n.  =  3  x  10   ;  hence  the  term  n.T/d  is  negligible  also.   We 
l  iL 

therefore  can  add  to  our  list  of  remarks: 

8.  If  w  is  small  and  S  =  S  ,  network  costs  far  outweigh  any 
potential  savings  from  the  remote  site's  being  cheaper  (or 
free) . 

9.  If  w  is  small,  but  S  ^  S  ,  free  remote  storage  becomes  cost 

effective  when 

S  /S  >  (or  when  less  than  about  0.1  percent  of 

n 
iL 

the  data  base  is  staged).   If  remote  storage  is  not  free,  it 


still  may  become  cost  effective,  specifically  when 
dK  x  10" 


,„   ,«-5 


S  /S  > 


n.T  -  n. 
iL    lr 
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10.   If  wL  is  large  but  wr  is  negligible  (so  that  M  w  is  negligible 

as  is  assumed  in  (C)),  then  the  large  local  setup  cost  JLw 

may  make  remote  storage  economical  (even  if  n.   ~  n  ).   To 

lr    iL 

be  specific,  suppose  w =  1  sec.   (Setup  times  of  this  magnitude 
occur  in  some  operating  systems.)   Then  (from  (C))  remote 
storage  will  be  cost  effective  due  to  the  setup,  time  differen- 
tial whenever 

SK  <  M^*  x  105  =  30  VL, 
or,  for,  say,  M  =  $200  per  hr.,  SK  <  6  x  103. 
Computational  Results  and  Conclusions 

At  this  point  it  is  probably  a  good  idea  to  remind  the  reader 
of  the  limitations  of  this  model.   The  model  depicts  the  cost  of  a 
program  which  accesses  data  that  reside  at  some  remote  site.   No  attempt 
is  made  to  consider  the  advantages  of  remote  processing,  of  multiple 
copies  for  reliability,  etc. ,  although  some  indirect  implications  along 
these  lines  are  possible.   We  can,  however,  use  the  model  to  investigate 
various  strategies  (such  as  local  caching  of  data)  and  to  evaluate  their 
effectiveness  in  utilizing  remote  resources  under  various  conditions. 
This  section  contains  the  results  of  such  experiments.   The  graphs  of 
this  section  have  all  been  generated  using  the  basic  parameter  values 
listed  for  the  example  discussed  in  detail  earlier;  that  is,  all  param- 
eter values  not  specified  in  text  or  figure  caption  are  to  be  assumed 
those  given  in  the  example.   Thus  results  are  to  be  interpreted  as 
holding  in  the  general  context  of  that  basic  example.   As  we  discussed 
in  the  section  just  preceding,  some  system  parameters  are  subject  to 
wide  variation,  and  changing  them  can  have  a  dramatic  effect  on  relative 
sizes  of  terms,  as  well  as  on  absolute  total  costs. 
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The  basic  result  one  finds  from  manipulating  this  model  is 
that  the  network  must  be  heterogeneous  in  order  for  remote  storage  to 
provide  any  cost  advantage.   There  are  several  ways  in  which  this 
heterogeneity  may  be  achieved: 

1.  Excess  capacity.   This  may  be  achieved  either  by  having  a 
network  of  similar  systems  in  which  one  or  more  are  not 
heavily  loaded  or  by  having  different  systems  that  can  take 
advantage  of  their  differences  (speed,  special  hardware,  etc.) 
to  generate  excess  capacity. 

2.  Inexpensive  storage.   This  may  be  achieved  by  either  charging 
policies  or  by  special  facilities  such  as  the  ARPA  Network 
Data  Computer,  laser  stores,  etc. 

3.  Artificially- induced  heterogeneity.   This  may  be  achieved  by 
politically  setting  charging  rates  at  some  sites  so  that  they 
are  significantly  cheaper  than  at  other  sites.   This  last 
method  can  be  fairly  dangerous  to  implement  as  can  happen  when 
reality  is  traded  for  illusion.   Experience  has  shown  that,  if 
charges  are  sufficiently  low  (or  free),  management,  as  opposed 
to  users,  will  tolerate  incredibly  poor  response  in  order  to 
use  only  that  resource. 

Let  us  first  consider  what  effect  attempts  to  introduce 

heterogeneity  into  system  cost  have  on  overall  cost.   To  introduce 

heterogeneity  we  will  set  M  ,  m  ,  and  u  to  be  some  fraction,  Z,  of  MT , 

r   r      r  L 

m  and  u  ,  respectively.   This  differential  can  be  considered  to  be 
caused  by  different  hardware,  different  system  loads,  or  different 
charging  policies  on  the  local  and  remote  systems.   For  this  situation 
as  in  all  others  discussed  in  this  section  we  are  only  considering  cases 
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in  which  a  part  of  the  data  set  is  moved  (i.e.,  S  4   S  ).   This  situation 
is  intended  to  correspond  to  the  process'  only  moving  the  data  it  needs. 
As  one  can  see  by  looking  at  figure  4,  Z  has  little  effect  on  the  overall 
cost  for  a  data  staging  model.   This  result  is  not  surprising  if  we  con- 
sider  that,  for  these  values  of  S  and  S  ,  about  75%  of  the  cost  ($46)  is  in 
storage  charges  (Term  1)  and  network  packet  charges  (Term  6).   In  addition, 
about  $9  is  spent  on  local  accessing  (Terms  (8)  and  (9)).   Therefore  Z  may 
have  a  larger  effect  in  remote  processing  environments  in  which  storage 
and  net  charges  would  not  constitute  such  a  large  fraction  of  the  cost 
(i.e.,  if  relatively  more  processing  time  is  consumed  in  the  staging). 
However,  lowering  remote  storage  charges  and  compressing  the  data  for 
shipment  over  the  network  can  produce  a  remote  strategy  which  provides 
significant  savings  over  the  local  strategy,  as  can  be  seen  from  figures 
5  and  6.   The  availability  of  exotic  mass  stores  (such  as  the  laser 
memory,  which  can  provide  one  or  even  two  orders  of  magnitude  differen- 
tial in  price)  can  make  a  remote  strategy  a  very  viable  one.   Notice  that 
the  results  pictured  in  figures  5  and  6  may  be  compared  with  Remark  9  in 
the  analysis  of  the  preceding  section.   From  that  remark,  the  crossover 
point  (where  remote  storage  becomes  cost  effective)  can  be  estimated  to 

be  S  =  5  x  10   for  n.   -  n.T  =  1.5  x  10~   and  K  =  1.   Figure  5  shows  this 

lr    lL 

crossover  at  S  =  1.7  x  10  .   This  agreement  is  quite  reasonable;  much  of 
the  discrepancy  can  be  attributed  to  the  assumption  in  the  analytical 
study  that  A  =  0.   A  similar  comparison  holds  for  the  other  crossovers 
shown. 

It  is  interesting  to  note  that  protocol  costs  and  network- 
related  host  software  costs  make  up  a  fairly  small  fraction  (normally 
less  than  10%)  of  the  total  cost  of  data  staging.   (See  the  section 
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Figure  4 

Cost  of  staging  from  remote  archive  to  local 

8  i 

drum  as  a  function  of  Z.   S'  =  10  ;  S  =  2  x  10 

Costs  are  in  dollars  per  month. 
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Figure  5 

The  percent  increase  in  cost  of  remote  over  local  strategy 

as  a  function  of  the  amount  of  data  moved.   Percent  increase 

is  computed  as  (g. .  -  f..)  x  100/f...  where  f..  is  the  best 

ij    ij         ij         rj 

strategy  for  local  inactive  store  and  g. .  is  best  for  remote 
inactive  store.   Remote  storage  devices  are  assumed  to  cost 
half  as  much  as  local  ones;  local  costs  are  as  given  in 
table  3.   The  effect  of  varying  the  compression  factor  K  is 
also  shown.   For  convenience,  computed  points  are  joined  by 
straight  lines;  the  curves  are  actually  smooth. 
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just  preceding  for  analysis  of  these  costs.)   Most  interesting  are  the 
facts  that  network  software  costs  (NCP,  etc.)  are  less  than  1%  of  the 
total  cost,  and  that  (outside  of  packet  charges)  the  major  proportion 
of  network-related  costs  arise  from  the  delays  incurred  and  from  connec- 
tion setup  overhead.   Since  setup  costs  are  relatively  constant,  as 
other  factors  become  larger  due  to  (for  example)  moving  more  data  or 
more  remote  processing  the  significance  of  these  terms  dwindles  and  more 
complex  protocol  negotiations  become  viable.   Figure  7  gives  some  indica- 
tion of  how  protocol  costs  vary  as  a  function  of  the  number  of  protocol 

messages  exchanged  before  the  transfer  commences.   (The  amount  of  data 

3 
transferred  is  held  constant  at  5  x  10  bytes.)   A  more  detailed  analysis 

of  the  aspects  of  network  overhead  is  needed,  especially  with  regard  to 

its  implications  for  front-ends.   An  analysis  of  the  overall  impact  of 

network  software  on  the  host  system  would  be  useful  to  determine  under 

what  circumstances  front-ending  is  a  useful  tactic. 

We  also  found  that  increasing  network  bandwidth  had  little 

effect  on  lowering  total  cost.   For  example,  with  Z  =  .1  and  S  =  10,000 

3  5 

bytes,  increasing  network  bandwidth  from  5  x  10  bytes/sec  to  5  x  10 

bytes/sec  resulted  in  just  over  a  2%  decrease  in  cost.   This  implies 
that  for  bulk  transfers  network  delay  costs  are  relatively  small. 
However,  this  does  not  imply  that  increasing  bandwidth  will  not  be  cost- 
effective.   Many  highly  interactive  network  activities  and/or  global 
traffic  levels  may  require  higher  bandwidths. 

Local  caching  of  data  appears  to  be  a  useful  method  for  using 
a  network  in  a  cost-effective  manner.   With  this  method,  the  local 
system  maintains  a  partial  copy  of  the  data  set.   The  contents  of  this 
copy  are  determined  by  the  results  of  past  accesses  or  in  some  cases  by 
some  knowledge  of  what  will  be  needed.   When  the  user  requests  data  the 
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Figure  7 

Increase  of  relative  contribution  of  protocol  costs  to  total  cost 
of  remote  strategy  as  e  (the  number  of  message  exchanges)  increases. 
Only  a  small  data  set  is  assumed  transferred,  and  few  accesses  are 
assumed  (S  =  5000,  r  =  5000,  q  =  0) . 
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system  first  looks  to  see  if  the  data  is  local;  if  so  it  is  fetched  from 
the  local  storage  medium;  if  not  then  it  must  be  retrieved  from  the 
master  copy  over  the  network.   This  is  a  sort  of  "network  working  set" 
strategy.   Using  the  model  to  investigate  the  properties  of  such  a 
strategy,  we  found  (see  figure  8)  a  rather  steep  rise  in  cost  as  the 
fraction  of  requests  that  must  use  the  network  increased.   Of  course, 
whether  most  requests  can  be  answered  locally  depends  upon  the  size  of 
the  local  store,  the  degree  of  locality  exhibited  by  the  requests  and 
the  replacement  algorithm  used.   However,  if  the  fraction  of  remote 
requests  can  be  kept  low,  significant  savings  can  be  achieved  by  the 
local  caching  of  data.   Further  work  is  needed  to  determine  the  locality 
properties  of  data  base  activity  so  that  one  can  determine  what  the  size 
of  the  local  store  must  be  so  that  a  large  fraction  of  the  requests  may 
be  satisfied  locally. 

As  we  have  seen,  the  major  result  of  this  investigation  is 
that  heterogeneity  must  be  introduced  into  a  network  before  remote 
storage  is  advantageous  for  a  user,  and  even  then  a  minimum  amount  of 
data  should  be  moved  to  the  remote  site  and  a  maximum  amount  of  com- 
puting should  be  done  once  it's  been  moved.   Interestingly  enough,  host- 
related  network  software  overhead  does  not  contribute  significantly  to 
the  total  cost.   It  is  not  clear  what  implications  this  has  for  the 
arguments  for  front-ending  systems;  however,  a  closer  look  at  these 
problems  should  be  undertaken.   An  analysis  of  network  software  from 
the  point  of  view  of  the  host  operating  system  rather  than  from  that  of 
a  single  process  is  needed  to  answer  the  questions  generated  by  these 
findings. 
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Figure  8 

Total  cost  of  responding  to  a  set  of  requests 
vs.  f,  the  fraction  of  the  requests  requiring 
remote  access. 
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Plans  for  Further  Work 

Clearly,  much  more  can  be  learned  by  experimentation  with  the 
present  model.   By  using  parameters  that  describe  specific  systems  and 
their  costs,  we  should  be  able  to  develop  cost  comparisons  for  important 
real  applications.   However,  this  requires  that  accurate  measurements  be 
made  of  systems  to  get  useful  values  for  the  various  parameters.   In 
fact,  accurate  measurement  of  the  network  parameters  used  in  this  model 
are  sorely  needed,  in  addition  to  the  refinement  of  the  cost  terms  to 
allow  the  investigation  of  more  complex  situations. 

We  might  also  investigate  other  approaches  to  deciding  on  a 
"best"  storage  policy.   For  example,  since  protocol  implementations 
reside  as  user-level  processes  in  many  operating  systems,  and  since  it 
is  often  useful  to  consider  the  data  set  as  being  staged  in  the  remote 
system,  it  would  be  interesting  to  consider  an  alternative  approach 
which  runs  as  follows:   The  data  set  allocations  on  the  remote  site  are 
determined  according  to  the  LSWL  model,  and  the  lowest-cost  strategy  is 
selected.   The  cost  of  this  strategy  plus  the  relevant  network  costs  are 
then  used  to  form  the  lowest  level  of  the  local  hierarchy,  where  the 
cost  for  the  local  levels  is  computed  using  the  LSWL  model  and  the  last 
level  (the  remote  one)  uses  a  slightly  modified  form.   Further  study  is 
needed  to  determine  whether  this  approach  will  yield  useful  data  for 
decision  making. 

There  are  a  number  of  other  possible  extensions  of  this  study 
which  would  be  worth  pursuing  in  the  future.   A  few  of  these  extensions, 
which  include  both  model  refinements  and  useful  applications,  are  listed 
here. 
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1.  Considering  the  various  terms  as  independent  modules  would 
provide  a  much  more  flexible  framework  in  which  various  system 
architectures  and  strategies  could  be  appraised. 

2.  The  effects  of  the  finite  size  of  the  storage  devices  might  be 
included. 

3.  As  mentioned  earlier,  the  definition  of  the  adjusted  system 
cost  m  does  not  appear  to  reflect  the  effects  of  increased 
load  on  the  system.   This  point  requires  more  investigation  to 
gain  a  better  understanding  of  this  parameter  and  of  how,  if 
necessary,  system  loads  may  be  inserted  into  the  model. 

4.  The  model  developed  by  Lum  et  al.  was  intended  to  represent 
file  migration  or  data  staging.   Thus,  when  a  data  set  is 
written  back  to  the  inactive  device,  the  operation  is  con- 
sidered to  be  symmetrical  to  the  original  read.   If  this  model 
is  to  be  an  accurate  characterization  of  a  data  management 
system,  it  will  be  necessary  to  include  the  cost  of  performing 
updates. 

5.  Since  data  base  reliability  appears  to  be  one  of  the  major 
advantages  of  distributing,  it  is  very  important  that  the 
model  be  capable  of  evaluating  the  cost  of  various  multi-copy 
backup  schemes  with  respect  to  the  level  of  reliability  they 
provide. 

6.  It  would  be  worthwhile  to  consider  the  arguments  for  and 
against  front-ending  and  try  to  determine  under  what  circum- 
stances front-ending  will  be  advantageous. 
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